from sklearn.metrics import silhouette_samples, silhouette_score
from pandas.plotting import scatter_matrix
from IPython.display import Image
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import seaborn as sns
import re
#from mpl_toolkits.basemap import Basemap
import pandas as pd
import numpy as np
import math
from IPython.display import Markdown
from collections import Counter
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.manifold import TSNE
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
It is no secret that the objective of every business is to increase their size whether it be sales or market share which is indicative of growth. Only recently did we hear that Apple has hit a company valuation of one trillion Dollars. One colossal driver for that success is the behind-the-scenes market analytics that researches the consumers willing to spend thousands of dollars on tech products. This is why consumer data is probably one of the most sought after and inaccessible type of data category in the world.
Insights into your customers = Actional intel = $
We were lucky to find a dataset which contained two years’ worth of transactions for a UK based online retail store. The company sells “unique all-occasion gifts.” The dataset, “E-Commerce Data,” was downloaded from Kaggle: https://www.kaggle.com/carrie1/ecommerce-data
For the last 50 years, one extremely powerful method of understanding a business’s market or customer base is through market segmentation. Companies that try to market products to the masses or approach their marketing through a “one-size-fits-all” methodology will eventually fail. Market segmentation divides the customer base into groups of individuals that are similar in specific ways. The following tables illustrates the core ways of segmenting customers:
display(Image('https://s3-eu-west-1.amazonaws.com/tutor2u-media/subjects/business/diagrams/marketing-segmentation-bases1.jpg', width=700, unconfined=True))
display(Image('https://s3-eu-west-1.amazonaws.com/tutor2u-media/subjects/business/diagrams/marketing-segments-intro.jpg', width=700, unconfined=True))
The benefits of this practice is invaluable to the business and include:
We will follow the protocols outlined in the BLT section 10.8 – Clustering Evaluation to measure the effectiveness of our algorithms/methods. This includes using cluster validity estimation which a measure of goodness of created clusters. Although such measure is inherently subjective, we can still determine clustering tendency of the data set. We can also compare the results of two different sets of cluster analysis. We will also do a visual inspection and determine a similarity matrix. Finally, we will be examining the silhouette score as well.
As discussed previously, our chosen validation methods have been suggested by professor Larson and are generally accepted as industry standard. We’ve also determined that for the stakeholder, if the clustering conclusions resulted in actional insights that the business can capitalize on, then we have met their needs.
Table Sources: https://www.tutor2u.net/business/reference/market-segmentation
There are 541,909 records in our online marketing data research. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
Following are the factor names and descriptions:
Factor Name |
Type |
Description |
|---|---|---|
InvoiceNo |
Nominal |
A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation |
StockCode |
Nominal |
A 5-digit integral number uniquely assigned to each distinct product. |
Description |
Nominal |
Product (item) name. |
Quantity |
Numeric |
The quantities of each product (item) per transaction. |
InvoiceDate |
Numeric |
The day and time when each transaction was generated |
UnitPrice |
Numeric |
Product price per unit in sterling. |
CustomerID |
Nominal |
A 5-digit integral number uniquely assigned to each customer. |
Country |
Nominal |
The name of the country where each customer resides |
# Import Online marketing data
marketing_data = pd.read_excel('./Data/Online Retail.xlsx')
marketing_data.head()
#Data types information of each attributes:
marketing_data.info()
# Find the number of missing values in each column:
print(marketing_data.isnull().sum(axis=0))
na_cols = marketing_data.loc[:,marketing_data.isnull().mean()>.0].columns
for col in na_cols:
print(col, ' column is ',round(marketing_data[col].isnull().mean(),5)*100,'% null')
plt.figure(figsize=(16,10))
plt.xlim(0,1)
plt.suptitle('Percentage of null values per column', size=20)
marketing_data.isnull().mean().plot.barh();
plt.show()
When looking at the number of missing values in each attribute, we found there are 135,080 records, almost 25% of our data set, where CustomerID was null. Part of our research is based on identifying high value customers. We looked at creating a unique CustomerID for each country where Customer ID is null. This created a highly valuable single customer, which most likely does not exist. We decided to remove the records where CustomerID was null for this part of the research. However, part of our research also looks to develop Association Rules. This part of the research benefits from retaining as much data as possible. Our final decision was to use a data set with CustomerID removed where CustomerID was an important feature, and use a second data set retaining all CustomerID's that are null and assigning a unique number for each country.
# Make the duplicate of the data set
marketing_data_copy = marketing_data.copy()
# Remove the rows with missing in CustomerID column:
marketing_data.dropna(subset=['CustomerID'], inplace=True)
# Find the first and last orders in our dataset:
marketing_data['InvoiceDate'].min()
marketing_data['InvoiceDate'].max()
This dataset includes orders done between Dec 2010 and Dec 2011.
marketing_data.shape
# Now we have both datasets with no null values
marketing_data.isnull().sum(axis=0)
Now, checking to identify duplicates in our dataset, we found 5,225 duplicate rows and removed them.
# Find and remove the duplicate records
display('Duplicated records: {}'.format(marketing_data.duplicated().sum()))
marketing_data.drop_duplicates(inplace = True)
marketing_data.shape
We are replacing all descriptions that don't have any letters or are not strings.
for ind,el in enumerate(marketing_data['Description']):
if type(el) != str:
marketing_data.loc['Description',ind]='Unknown'
elif re.match('.*[a-zA-Z]+.*',el) is not None:
pass
else:
marketing_data.loc[ind,'Description']='Unknown'
$\textbf{New Features}$
Our research required we created new features to helps us better understand the data perform some the analysis we were interested in. These are the new features:
Feature |
Description |
|---|---|
TotalPrice |
It is the total price of each items (Quantity * Unit Price) |
Recency |
It is the number of days since last purchase for each invoice and customer |
Frequency |
Total number of purchases for each customer (it will be imputed in groupby customer section) |
Monetary |
Total money that each customer spent (it will be imputed in groupby customer section) |
# Create the TotalPrice feature
marketing_data['TotalPrice'] = marketing_data['Quantity']*marketing_data['UnitPrice']
# Create Recency Feature for each invoice:
Current=marketing_data['InvoiceDate'].max()
marketing_data['Recency']=(Current-marketing_data['InvoiceDate']).astype('timedelta64[D]')
marketing_data['Recency']=marketing_data['Recency'].astype(float)
Here the Recency column has timedelta64[ns] data type. For easier use in our analysis, we extract the days in integer format, representing the number of days since the maximum date in our dataset.
marketing_data.head()
marketing_data.info()
$\textbf{Examine Data Grouped by Country}$
Create a data frame that shows the number of unique feautures in each country. We are looking to see if there is any interesting data that shows up based on which countries orders come from. We want to see which specific countries buy the most items at one time, buys the most expensive items, and spends the most money on any single item. This information may be actionable by the business.
per_country = marketing_data.groupby(['Country']).nunique()
per_country= per_country.drop('Country',axis=1)
per_country.reset_index(level=0, inplace=True)
per_country[['Country','InvoiceNo','StockCode','Description','Quantity','InvoiceDate','UnitPrice','CustomerID']]
The chart below shows the number of unique customer ID's in each country.
per_country.sort_values(by=['CustomerID'],ascending=False,inplace=True)
num_cust = per_country['CustomerID']
country = per_country['Country']
plt.figure(figsize=(16,10))
#plt.barh(country, num_cust,)
#plt.xticks(y_pos, country )
#ax = sns.barplot(x=num_cust,y=country,orient='h')
#ax.set(xlabel ='Number of Customer', ylabel='Country')
plt.title('Number of Customers per Country')
plt.xlabel('Number of Customers')
#ax.fig.set_size_inches(15,15)
sns.barplot(x=num_cust,y=country,orient='h')
plt.show()
display(per_country[['CustomerID','Country']])
The United Kingdom dominates our customer list. In fact over 90% of customers are from the United Kingdom
cust_UK_row = per_country.loc[per_country['Country'] =='United Kingdom']
cust_UK = cust_UK_row['CustomerID'].sum()
cust_ROW_row = per_country.loc[per_country['Country']!='United Kingdom']
cust_ROW = cust_ROW_row['CustomerID'].sum()
display('{0:.2%} of customers are in the United Kingdom'.format((cust_UK/(cust_UK+cust_ROW))))
labels = 'UK Customers','ROW Customers'
cust_count = [cust_UK, cust_ROW]
colors = ['#ff9999','#66b3ff']
explode = [0,0.2]
plt.axis('equal')
plt.pie(cust_count, labels=labels, explode=explode, colors=colors, autopct='%1.2f%%', shadow=True, startangle=0)
plt.show()
count_by_country = marketing_data.groupby(['Country']).count()
count_by_country['Avg_UnitPrice'] = marketing_data.groupby(['Country'])['UnitPrice'].mean()
count_by_country['Avg_Quantity'] = marketing_data.groupby(['Country'])['Quantity'].mean()
count_by_country['Avg_TotalPrice'] = marketing_data.groupby(['Country'])['TotalPrice'].mean()
count_by_country.reset_index(level=0, inplace=True)
count_by_country.sort_values(by=['CustomerID'],ascending=False,inplace=True)
This chart below is showing us which country is buying the most of any single item.
count_by_country.sort_values(by=['Avg_Quantity'],ascending=False,inplace=True)
avg_quantity = count_by_country['Avg_Quantity']
country = count_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Items per Line Item by Country')
plt.xlabel('Number of Customers')
#ax.fig.set_size_inches(15,15)
sns.barplot(x=avg_quantity,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_Quantity', 'Country']])
The chart below is showing us which country is buying the most expensive items.
count_by_country.sort_values(by=['Avg_UnitPrice'],ascending=False,inplace=True)
avg_price = count_by_country['Avg_UnitPrice']
country = count_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Unit Price per Line Item by Country')
plt.xlabel('Number of Customers')
total = float(len(count_by_country))
#ax.fig.set_size_inches(16,10)
sns.barplot(x=avg_price,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_UnitPrice', 'Country']])
Singapore seems like it is an outlier in this area, buying a single item of a much higher price than any other country. The table below explains this. There were four manual charges to what looks like 4 consecutive customers for some high-priced item, and then the invoices were cancelled. Since we decided earlier to leave cancelled items in the data set we will also leave this here and just note this outlier.
display (marketing_data.loc[(marketing_data['Country']=='Singapore')])
The table below is showing which country has the highest average Total Line Item Price. The order in this list in similar to the average number of line items table above. It makes sense that the country that buy the most of a single item will also have the most amount of money spent on a single item. We can take note the outlier found above for Singapore may affect their standing in this chart. The outlier was previously noted.
count_by_country.sort_values(by=['Avg_TotalPrice'],ascending=False,inplace=True)
avg_sale = count_by_country['Avg_TotalPrice']
country = count_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Total Price per Line Item by Country')
plt.xlabel('Number of Customers')
#ax.fig.set_size_inches(15,15)
sns.barplot(x=avg_sale,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_TotalPrice', 'Country']])
Here we examine each transaction's Total Price by country, looking for outliers. Since Total Price is a combination of Quantity and Unit Price, we will be able to check both of these features with this chart.
marketing_data['TotalPrice'] = marketing_data['UnitPrice'] * marketing_data['Quantity']
plt.figure(figsize=(16,10))
sns.boxplot(x='TotalPrice', y='Country', data=marketing_data, orient='h')
display (marketing_data.describe())
The graph above is not very granular, however we can see some outliers from the United Kingdom. The outlier seems to be symmetric so we need to examine these. The outliers appear to all be above 3500 and below -$3500. Below we look for the transactions that meet this criterion.
min_sales = marketing_data.loc[(marketing_data['TotalPrice'] < -35000)]
max_sales = marketing_data.loc[(marketing_data['TotalPrice'] > 35000)]
display(min_sales)
display(max_sales)
The Invoice numbers for all the high value negative transactions have a 'C' at the beginning. We can speculate these are cancelled orders. The Stock codes for two of the three matching transactions also match. This is further proof these are not mistakes, but actual orders that were cancelled. We will retain them in the data set.
Here we want to collect all the relevant information for each unique invoice. Invoices are spread all over the data set since each row represents a single line item in a unique invoice.
This data set groups by InvoiceNo which results in having one row representing one unique invoice, total transaction value of that invoice, customer, country, invoice date, total quantity and descriptions plus all other information related to the invoice.
by_invoice = marketing_data.groupby(['InvoiceNo', 'InvoiceDate','Country','CustomerID', 'Recency'], as_index=False)[ 'UnitPrice','Description','Quantity','TotalPrice'].agg(lambda x: list(x))
by_invoice['TotalPrice'] = by_invoice['TotalPrice'].apply(sum)
by_invoice['QuantityTotal'] = by_invoice['Quantity'].apply(sum)
by_invoice['Description_as_string'] = by_invoice['Description'].apply(lambda x: "-".join(sorted(x)))
by_invoice.head()
This chart below is showing us which country spends the most money on average on each invoice. It is not surprising to see the Netherlands leading this chart as this country was at the top of Average Price per Line Item and the Average Number of Items per Line Item lists.
invoice_by_country = by_invoice.groupby(['Country']).count()
invoice_by_country['Avg_TotalPrice'] = by_invoice.groupby(['Country'])['TotalPrice'].mean()
invoice_by_country.reset_index(level=0, inplace=True)
invoice_by_country.sort_values(by=['Avg_TotalPrice'],ascending=False,inplace=True)
avg_invoice = invoice_by_country['Avg_TotalPrice']
country = invoice_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Total Invoice per Country')
plt.xlabel('Number of Customers')
sns.barplot(x=avg_invoice,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_TotalPrice', 'Country']])
Here our research is based on customers, therefore it is useful to group our dataset by Customer ID and create two other new features (Frequency and Monetary) for each customer from there.
by_customer = by_invoice.groupby(['CustomerID'], as_index=False)['InvoiceNo', 'InvoiceDate', 'Recency', 'Description','Quantity','TotalPrice'].agg(lambda x: list(x))
by_customer['Frequency'] = by_customer['InvoiceNo'].apply(lambda x: len(x))
by_customer['Monetary'] = by_customer['TotalPrice'].apply(sum)
by_customer['Recency'] = by_customer['Recency'].apply(min)
by_customer['Quantity'] = by_customer['Quantity'].apply(lambda x: sum(x,[]))
by_customer['QuantityTotal'] = by_customer['Quantity'].apply(sum)
#by_customer['Description_joined'] = by_customer['Description'].apply(lambda x: "-".join(sorted(x[0])))
by_customer.head()
Next, we examine the distribution of the Frequency and Recency features we created. The Recency of people purchasing from us is on average just under 100 days. The Frequency, which shows how often people buy, reveals we have quite a few good customers, with at least one customer buying from us 249 times. Further research to identify these high repeat customers would something to do in the future.
desc_rec = by_customer['Recency'].describe()
desc_freq = by_customer['Frequency'].describe()
list_of_desc = [desc_rec, desc_freq]
fig = plt.figure(figsize=(8,12))
sub1 = fig.add_subplot(2,2,1)
sub1.set_xlabel('Recency')
sub1.set_ylabel('Days')
sub1.boxplot(by_customer['Recency'],showmeans=True,)
sub2 = fig.add_subplot(2,2,2)
sub2.set_xlabel('Frequency')
sub2.set_ylabel('Days')
sub2.boxplot(by_customer['Frequency'],showmeans=True)
plt.show()
#sub3 = fig.add_subplot(2,2,3)
#ub3.plot.table(desc_rec)
display (desc_rec)
display (desc_freq)
by_invoice['Transaction']=''
for index,row in by_invoice.iterrows():
if str(row['InvoiceNo']).startswith("C"):
by_invoice.loc[index,'Transaction'] = 'Cancel'
elif str(row['InvoiceNo']).startswith("A"):
by_invoice.loc[index,'Transaction'] = 'Adjust'
else:
by_invoice.loc[index,'Transaction'] = 'Purchase'
by_invoice.sample(10)
Now create a word cloud from the Descriptions feature of the data set. This will highlight the most common words found in each line item of descriptions of the order. Alarm, Clock, Bakelike, Red and Children are some of the most common words in the description field.
# Code from stackoverflow: https://stackoverflow.com/questions/16645799/how-to-create-a-word-cloud-from-a-corpus-in-python
stopwords = set(STOPWORDS)
def show_wordcloud(data, title = None):
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
max_words=200,
max_font_size=40,
scale=3,
random_state=1 # chosen at random by flipping a coin; it was heads
).generate(str(data))
fig = plt.figure(1, figsize=(16, 12))
plt.axis('off')
if title:
fig.suptitle(title, fontsize=20)
fig.subplots_adjust(top=2.3)
plt.imshow(wordcloud)
plt.show()
show_wordcloud(marketing_data['Description'])
What is an RFM models and why use it?
RFM stands for Recency, Frequency, & Monetary Model which is a classic business tool that’s benefited the marketing and sales departments of high customer volume companies. The basic question proposed in these departments is: Who is your “best” customer. And here lies the problem of determining which customers qualify as “best” since it’s such a subjective word. Some businesses value loyalty over high sales since they believe consistency and predictability is best for them. Others believe only in high volume sales even if it was seasonal, and so on.
The RFM Model comes from a variation of previous models that attempt to segment the business’s customer base or potential customer base quantitatively using the three variables scoring methodology. This might seem intuitive, but as we discuss in the Deployment Section, it’s not often implemented as we’d think.
Further examination from the predictive analytics software company, Canopy Labs excerpt:
1) Recency: When was the last time a customer made a purchase order with your business? According to the RFM model, a customer who has recently interacted with your store is more inclined to accept another interaction that you initiate.
2) Frequency: How regularly does this customer make a purchase with your business? Over time, the frequency of a purchase can, in most cases, predict the likelihood and schedule of future purchases.
3) Monetary Value: How much money does this customer spend over a period of time? Depending on what makes more sense for your company, you might decide to calculate monetary value as revenue or profitability. Either way, this input adds monetary spend into the equation to calculate your RFM.
Source: https://canopylabs.com/resources/an-introduction-to-the-rfm-model/
Coding Source: https://towardsdatascience.com/find-your-best-customers-with-customer-segmentation-in-python-61d602f9eee6
#Dataset used with new appended fields
#Dataset accounts for cancelled orders.
by_customer.head()
# Duplicated the dataframe to avoid accidental manipulation.
maa = by_customer.copy()
#Isolated the variables that are relevant to the analysis
mad = maa[['CustomerID', 'Recency', 'Frequency', 'Monetary']]
mad.set_index('CustomerID',inplace=True)
mad.head()
#mad.to_excel('C:/Users/Bahr/Desktop/dakas.xlsx') #Testing dataset accuracy
#maa.to_excel('C:/Users/Bahr/Desktop/dakasss.xlsx') #Testing dataset accuracy
#The following table is an RFM table
#Create customer segments from RFM model using Quartiles
#Here we want to create an RFM score for each customer to easily identify the customer's behavior.
#The score is from 1-4 but can be 1-5, etc.
quant = mad.quantile(q=[0.25,0.5,0.75])
quant
quant.to_dict()
#Here we are creating the RFM segmentation rankings.
#The score is from 1-4 but can be 1-5, etc.
#A rank of 4 will be the highest/best. A rank of 1 will be the worst/lowest.
#We have two arguments because in case of recency, the lower the number (days) the better.
#Arguments (x= value, p = recency, monetary_value, frequency, d = quartiles dict)
def Rscore(x,p,d):
if x <= d[p][0.25]:
return 4
elif x <= d[p][0.50]:
return 3
elif x <= d[p][0.75]:
return 2
else:
return 1
#Arguments (x= value, p = recency, monetary_value, frequency, d = quartiles dict)
def FMscore(x,p,d):
if x <= d[p][0.25]:
return 1
elif x <= d[p][0.50]:
return 2
elif x <= d[p][0.75]:
return 3
else:
return 4
#Create RFM segmentation table
segmentation = mad.copy()
segmentation['R_Quartile'] = segmentation['Recency'].apply(Rscore, args=('Recency',quant))
segmentation['F_Quartile'] = segmentation['Frequency'].apply(FMscore, args=('Frequency',quant))
segmentation['M_Quartile'] = segmentation['Monetary'].apply(FMscore, args=('Monetary',quant))
print(segmentation[segmentation['Frequency'].isna()])
Now each customer is assigned in a quartile.
segmentation.head()
#adding RFM score for each customer
segmentation['RFM_Score'] = segmentation.R_Quartile.map(str) \
+ segmentation.F_Quartile.map(str) \
+ segmentation.M_Quartile.map(str)
segmentation.head()
print(segmentation[segmentation['Frequency'].isna()])
#Filtering for the "best" Customers.
segmentation[segmentation['RFM_Score']=='444'].head(10)
#Here we will assign the scores the appropriate segments and quantify customers that fall in those groups.
display(Image('https://i.imgur.com/vEUDYTc.png', width=700, unconfined=True))
#Source: https://www.blastam.com/blog/rfm-analysis-boosts-sales
Earlier we suggested that a score of 4 was the best in each feature. This is because some analysts add the score from the three features and rank the customers based on the total. Example: a total score of 12 ranks the customers the highest. This simplification is made for the sales field people who don’t want to drill down into the data. In the chart above, the best customers have the lowest total score but, in this case, their purpose wasn’t to add up the score. Their intention is to use a 3-digit representation as a score. We will stick with the "444" as representing the best customer.
#Manual assigning of customer groups trial 1.
print("Best Customers: ",len(segmentation[segmentation['RFM_Score']=='444']))
print('Loyal Customers: ',len(segmentation[segmentation['F_Quartile']==4]))
print("Big Spenders: ",len(segmentation[segmentation['M_Quartile']==4]))
print('Almost Lost: ', len(segmentation[segmentation['RFM_Score']=='244']))
print('Lost Customers: ',len(segmentation[segmentation['RFM_Score']=='144']))
print('Lost Cheap Customers: ',len(segmentation[segmentation['RFM_Score']=='111']))
#Manual assigning of customer groups trial 2.
print("Best Customers: ",len(segmentation[segmentation['RFM_Score']=='444']))
print('Growth Customers: ', len(segmentation[segmentation['RFM_Score']=='333']))
print('Tipping point Customers: ',len(segmentation[segmentation['RFM_Score']=='222']))
print('Lost Customers: ',len(segmentation[segmentation['RFM_Score']=='111']))
#Manual assigning of customer groups trial 3.
upper_list = ['444','433','434','333','344','343','443','334']
lower_list = ['222','211','212','111','122','121','112','221']
#len([k for k in temp if segmentation['RFM_Score'] in k])
print("Valued Customers: ",len( segmentation[segmentation['RFM_Score'].isin(upper_list) ]))
print('Normal Customers: ',len(segmentation[segmentation['RFM_Score'].isin(lower_list)]))
#Using K-Means Clustering on RFM Variables
#Create new dataframe to avoid complications.
rfm_data = segmentation.drop(['R_Quartile','F_Quartile','M_Quartile','RFM_Score'],axis=1)
rfm_data.head()
print(rfm_data[rfm_data['Frequency'].isna()])
rfm_data.corr() #Feature Correlations
#From the heatmap and from the above feature correlation numbers we see that Frequency and Monetary are correlated best.
#We will explore this relationship.
sns.heatmap(rfm_data.corr(),cmap="Greens");
#Visualizing Feature Distributions
#From the visualization, we see that the variables are skewed thus we need to normalize.
scatter_matrix(mad, alpha = 0.3, figsize = (15,8), diagonal = 'kde')
scatter_matrix(rfm_data, alpha = 0.3, figsize = (15,8), diagonal = 'kde');
#Data normalization using log.
#Here we had a slight problem with NAN values. We replaced NAN with Mean.
rfm_r_log = np.log(rfm_data['Recency']+.01)
rfm_f_log = np.log(rfm_data['Frequency']+.01)
rfm_m_log = np.log(rfm_data['Monetary']+.01)
log_data = pd.DataFrame({'Monetary': rfm_m_log, 'Recency': rfm_r_log, 'Frequency': rfm_f_log})
log_data.head()
#Created another data frame to test NAN and infinity issues.
test = log_data
test = test.fillna(test.mean())
test.head()
np.any(np.isnan(test))
#Visualized result after normalization. Results look much less skewed.
scatter_matrix(test, alpha = 0.2, figsize = (14,7), diagonal = 'kde');
test.corr()
sns.heatmap(test.corr(),cmap="Greens");
Normalization have made the Monetary and Frequency even more correlated.
#K-Means Implementation ---best score is 2 clusters
matrix = test.values
for n_clusters in range(2,10):
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=100)
kmeans.fit(matrix)
clusters = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters)
print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
n_clusters = 2 # we will use this initially to visualize low value (F) and high value customers
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters_customers)
print('silhouette Score: {:<.3f}'.format(silhouette_avg))
#create a scatter plot identifying high value and low value/average customers.
plt.scatter(matrix[:, 0], matrix[:, 2], c=clusters_customers, s=50, cmap='plasma')
#select cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], c='black', s=200, alpha=0.5)
plt.xlabel('Monetary', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(clusters_customers).value_counts(), columns = ['Number Customers']).T
Overall the clustering of these two customer segments seems close to expected. Of course we'll see overlap due to the dynamic behavioral nature of customers.
Next we will plot 6 customer groups to see what it looks like.
n_clusters = 6 # 6 Clusters for potentially the 6 customer groups we've assigned earlier.
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters_customers)
print('silhouette score: {:<.3f}'.format(silhouette_avg))
#create a scatter plot identifying customers in increasing value.
plt.scatter(matrix[:, 0], matrix[:, 2], c=clusters_customers, s=50, cmap='plasma')
#select cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], c='black', s=200, alpha=0.5)
plt.xlabel('Monetary', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(clusters_customers).value_counts(), columns = ['Number Customers']).T
n_clusters = 4 # we will use
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters_customers)
print('silhouette Score: {:<.3f}'.format(silhouette_avg))
#create a scatter plot identifying customers in increasing value.
plt.scatter(matrix[:, 0], matrix[:, 2], c=clusters_customers, s=50, cmap='plasma')
#select cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], c='black', s=200, alpha=0.5)
plt.xlabel('Monetary', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(clusters_customers).value_counts(), columns = ['Number Customers']).T
import plotly
import plotly.graph_objs as go
def getTrace(x, y, z, c, label, s=2):
trace_points = go.Scatter3d(
x=x, y=y, z=z,
mode='markers',
marker=dict(size=s, line=dict(color='rgb(0, 0, 0)', width=0.5), color=c, opacity=1),
name=label)
return trace_points;
def showGraph(title, x_colname, x_range, y_colname, y_range, z_colname, z_range, traces):
layout = go.Layout(
title=title,
scene = dict(
xaxis=dict(title=x_colname, range = x_range),
yaxis=dict(title=y_colname, range = y_range),
zaxis=dict(title=z_colname, range = z_range)
)
)
figgg = go.Figure(data=traces, layout=layout)
plotly.offline.plot(figgg)
#Coding Source: http://www.semspirit.com/artificial-intelligence/machine-learning/clustering/k-means-clustering/k-means-clustering-in-python/
from sklearn.cluster import KMeans
n_opt_clusters = 2
kmeans = KMeans( init = 'k-means++', n_clusters = n_opt_clusters, max_iter=1000, n_init = 30, random_state=0)
kmeans.fit(matrix)
y_kmeans = kmeans.predict(matrix)
#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 25, c = 'yellow', label='Centroids')
t1 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 5, c='red', label = '0: Valued Customers')
t2 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 5, c='pink', label = '1: Average Customers')
t3 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 5, c='blue', label = '2: ')
t4 = getTrace(matrix[y_kmeans == 3, 0], matrix[y_kmeans == 3, 1], matrix[y_kmeans == 3, 2], s= 5, c='green', label = '3: ')
t5 = getTrace(matrix[y_kmeans == 4, 0], matrix[y_kmeans == 4, 1], matrix[y_kmeans == 4, 2], s= 5, c='cyan', label = '4: ')
x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Two Customer Segmentations", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4, t5, centroids])
Since the output is an HTML file, we have screenshot the plots. We will visualize segmentation based on 2, 3, and 5 centroids (previous 2ds plot showed 6 centriod/segments but we think that talking about 15 customers is not worth the squeeze.
display(Image('https://i.imgur.com/PORUzFF.jpg', width=950, unconfined=True))
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(y_kmeans).value_counts(), columns = ['Number Customers']).T
Ramifications: Having a subject matter expert in the group is definitely an advantage in analyzing these clusters. We started out with 2d plots of K-Means but were encouraged to get a more three-dimensional looks at the clustering output due to a suspicion that it will greatly increase insight. And it did. Were we produced two clusters labeled Valued Customer and Average Customer. The reason we chose two clusters is because the Silhouette Score as the highest. However, as we see, there is a group on the right that visually doesn’t look like it belongs.
Regardless, the numbers generated above identifying how many customers are in the first group and in the second group is not exactly in line with our manual RFM assignment in Trial 3. There might be variability in that comparison since Trial 3 had a hard cut off between the Valuable customers and the Average customers. Hence, the manual assignment of customers in Trial 3 omitted the spectrum of customers in between which is strategic in identifying the strongest customers, but only that. We however feel that the clustering amount here does not provide as much insight as we’d like. Thus, we will increase the clusters moving forward.
n_opt_clusters = 3
kmeans = KMeans( init = 'k-means++', n_clusters = n_opt_clusters, max_iter=1000, n_init = 30, random_state=0)
kmeans.fit(matrix)
y_kmeans = kmeans.predict(matrix)
#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 25, c = 'yellow', label='Centroids')
t1 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 5, c='red', label = '0: Valued Customers')
t2 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 5, c='pink', label = '1: Average Customers')
t3 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 5, c='blue', label = '2: Irregular Customers')
t4 = getTrace(matrix[y_kmeans == 3, 0], matrix[y_kmeans == 3, 1], matrix[y_kmeans == 3, 2], s= 5, c='green', label = '3: ')
t5 = getTrace(matrix[y_kmeans == 4, 0], matrix[y_kmeans == 4, 1], matrix[y_kmeans == 4, 2], s= 5, c='cyan', label = '4: ')
x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Three Customer Segmentations", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4, t5, centroids])
display(Image('https://i.imgur.com/H4E6SfS.jpg', width=950, unconfined=True))
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(y_kmeans).value_counts(), columns = ['Number Customers']).T
Ramifications: The Silhouette score in the 3-cluster group was .01 off from the 2 cluster group. So we’ll treat this result as theoretically the best clustering output. We finally see a distinction in that third cluster on the side that shouldn’t belong with the two central clusters. We’ve identified that group as the Irregular customers group. That customer group is tricky to pin down since it’s elongated and has everything from high monetary and recency to low monetary and low recency. The frequency remains the same. Which explains why in the 2d plots we will see later, the monetary vs frequency plots look much better than the monetary vs recency plots. That’s because there is so much more overlap since customer behavior is a spectrum.
n_opt_clusters = 5
kmeans = KMeans( init = 'k-means++', n_clusters = n_opt_clusters, max_iter=1000, n_init = 30, random_state=0)
kmeans.fit(matrix)
y_kmeans = kmeans.predict(matrix)
#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 25, c = 'yellow', label='Centroids')
t1 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 5, c='red', label = '0: Growth Customers')
t2 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 5, c='pink', label = '1: Tipping Point Customers')
t3 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 5, c='blue', label = '2: Best Customers')
t4 = getTrace(matrix[y_kmeans == 3, 0], matrix[y_kmeans == 3, 1], matrix[y_kmeans == 3, 2], s= 5, c='green', label = '3: Irregular Customers')
t5 = getTrace(matrix[y_kmeans == 4, 0], matrix[y_kmeans == 4, 1], matrix[y_kmeans == 4, 2], s= 5, c='cyan', label = '4: Loyal Customers')
x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Five Customer Segmentations", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4, t5, centroids])
display(Image('https://i.imgur.com/nJyoWo2.jpg', width=950, unconfined=True))
display(Image('https://i.imgur.com/2OChQPZ.jpg', width=950, unconfined=True))
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(y_kmeans).value_counts(), columns = ['Number Customers']).T
Ramifications: Our subject matter expert agrees that visually this is the best clustering graph yet. We tried 6 and 7 clusters but the insights derived from those reached a plateau. The customer groups we have here are the following:
Clustering output
Red: Growth Customers: 1455
Pink: Tipping Point Customers: 824
Blue: Best Customers: 767
Green: Irregular Customers: 105
Cyan: Loyal Customers: 1221
Looking at the first “Five Customer Segmentation” plot, we can clearly see how the Blue group, the Best Customers, have the highest monetary, recency, and frequency values (frequency can be better seen in the second screenshot). The Cyan group, the Loyal Customers were assigned that category because as we can see, they might not buy the most, or the most recent, but they are the highest overall group in terms of frequency. They are closer to us visually on the z axis. Next, we have the Red group, the Growth customers. They have been labeled that way, because unlike the Loyal customers, they lack in monetary and recency, but just behind them. This is a big opportunity for the sales folks, the middle customer segments that’s often ignored in business.
Next, we look at the second screenshot for the “Five Customer Segmentation.” Here we can see the frequency axis much better. And we immediately notice that the Green group, as mentioned previously are the worst customers in terms of frequency of sales. However, they are clustered in this elongated shape that makes them anywhere from big spenders to small spenders. Rightfully so, they are an Irregular customer group. And according to the 80/20 law, not worth the squeeze.
Looking at our RFM customer segmentation assignments in Trial 1 and 2, we feel these numbers out performed them due to the inclusive spectral nature of clustering.
Here, we perform a different type of clustering "Hierarchical Agglomerative Clustering, HAC". The algorithm initially treats each point as a single cluster. Then merges two items at a time into a new cluster. How the pairs merge involves calculating a dissimilarity between each merged pair and the other samples. There are many ways to do this, we used "Complete", "Ward" and "Average" linkage methods. The pairing process continues until all items merge into a single cluster.
Here is the definition of each methods used:
The distance method which is used in our clustering is Euclidean method.
We used the average silhouette approach for finding the number of clusters. Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. Silhouette scores can range from -1 to 1, with 1 being the best.
Average silhouette method computes the average silhouette of observations for different values of n. The optimal number of clusters n is the one that maximize the average silhouette over a range of possible values for n (Kaufman and Rousseeuw [1990]).
%%time
matrix = test.values
params = []
for link in ['ward', 'complete', 'average']:
for n in range(2,20):
# append on the clustering
hc = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage=link)
hc.fit(matrix)
newfeature_hc = hc.labels_
clusters_hc=hc.fit_predict(matrix)
hc_silhouette = silhouette_score(matrix, newfeature_hc)
#print(hc.labels_)
print("C=", n, "Link type=", link , "Avg Silhouette_score is:", hc_silhouette)
Here, the maximum Silhouette scores are belong to the average linkage method as below:
Here, the maximum Silhouette scores belong to the average linkage method which is 0.75 for number of clusters = 2. The clustering for n=3 and 4 are fairly close to the optimal score with a silhouette score of .68.
It worth to improve our HAC by looking at the dendrogram graph which helps to find the optimum number of clusters.
For our dendrogram graph, we used the ward linkage method, which results in a more acceptable graph. (we tried it with other different methods and at the end we decided to continue with ward linkage method.)
# Create Dendrogram graph:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(25,15))
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(matrix, method='ward'))
By looking at the Dendrogram graph, we can pick a distance where the branches are longer and draw a cut off from there. No cutting is wrong, and there is no definite procedure to do that. Several cutoffs might be equally good from a theoretical point of view. Due to this flexibility with how to cut the dendrogram we will be using visual queues of plotting the data with the different size groups to see which size of clusters provides the most insight.
By considering 60, 80 and 120 as the distance, we will have 4, 3 and 2 clusters. Next, we will inspect the number of clusters visually by scatter plots and will find out the number of clusters.
Here, we visually inspect the number of clusters with scatter plots for average method and number of clusters equal to 2, 3 and 4. We would like to find the optimal number of clusters.
data = matrix
# Scatter plots for n=2 and linkage=average
cls = AgglomerativeClustering(n_clusters=2, linkage='average')
cls.fit(data)
hac_labels_2 = cls.labels_
# Scatter plots for n=3 and linkage=average
cls = AgglomerativeClustering(n_clusters=3, linkage='average')
cls.fit(data)
hac_labels_3 = cls.labels_
# Scatter plots for n=4 and linkage=average
cls = AgglomerativeClustering(n_clusters=4, linkage='average')
cls.fit(data)
hac_labels_4 = cls.labels_
fig = plt.figure(figsize=(12,8))
title = ['Clusters=2','Clusters=3','Clusters=4']
for i,l in enumerate([hac_labels_2,hac_labels_3, hac_labels_4]):
plt.subplot(3,2,2*i+1)
plt.scatter(matrix[:, 0], matrix[:, 2]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=10, linewidths=0)
plt.xlabel('Monetary'), plt.ylabel('Frequency')
plt.grid()
plt.title(title[i])
plt.subplot(3,2,2*i+2)
plt.scatter(matrix[:, 0], matrix[:, 1]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=10, linewidths=0)
plt.xlabel('Monetary'), plt.ylabel('Recency')
plt.grid()
plt.title(title[i])
plt.tight_layout()
plt.show()
By reviewing the above graphs, customer segmentation in clusters equal to 3 is the better one, it illustrates the better division. But, still we do not find the average linkage method as our preferred method in this case as we do not see a good customer segmentation in this method. We may have a better segmentation in ward method.
In the next step, we will switch the linkage method to ward and will visualize the number of clusters in the range of 2 and 6. We will see if we can get a better customer segmentation.
# Scatter plots for n=2 and linkage=ward
cls = AgglomerativeClustering(n_clusters=2, linkage='ward')
cls.fit(data)
hac_labels_2 = cls.labels_
# Scatter plots for n=3 and linkage=ward
cls = AgglomerativeClustering(n_clusters=3, linkage='ward')
cls.fit(data)
hac_labels_3 = cls.labels_
# Scatter plots for n=4 and linkage=ward
cls = AgglomerativeClustering(n_clusters=4, linkage='ward')
cls.fit(data)
hac_labels_4 = cls.labels_
# Scatter plots for n=5 and linkage=ward
cls = AgglomerativeClustering(n_clusters=5, linkage='ward')
cls.fit(data)
hac_labels_5 = cls.labels_
# Scatter plots for n=6 and linkage=ward
cls = AgglomerativeClustering(n_clusters=6, linkage='ward')
cls.fit(data)
hac_labels_6 = cls.labels_
fig = plt.figure(figsize=(12,20))
title = ['Clusters=2','Clusters=3','Clusters=4', 'Clusters=5','Clusters=6']
for i,l in enumerate([hac_labels_2,hac_labels_3, hac_labels_4, hac_labels_5, hac_labels_6]):
plt.subplot(6,2,2*i+1)
plt.scatter(matrix[:, 0], matrix[:, 2]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=20, linewidths=0)
plt.xlabel('Monetary'), plt.ylabel('Frequency')
plt.grid()
plt.title(title[i])
plt.subplot(6,2,2*i+2)
plt.scatter(matrix[:, 0], matrix[:, 1]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=20, linewidths=0)
plt.xlabel('Monetary'), plt.ylabel('Recency')
plt.grid()
plt.title(title[i])
plt.tight_layout()
plt.show()
We can see good segmentation with 3 clusters and see superior segmentation using the ward method. The outliers on the monetary axis are not considered until we start to get 6 clusters, which did not appear to be an effective number of clusters based on our previous analysis with silhouette scores.
colors = ['purple','orange','lightblue','red','gray','green','fuchsia','blue','violet','gold',
'firebrick','olive','lime','navy','lightpink']
lbls = ['Valued Customers','Average Customers','Irregular Customers','forth','fifth','sixth','seventh']
db = DBSCAN(eps=0.33, min_samples=7).fit(data)
# core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
# core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
print("Percent of outliers: ",str(round((len(labels[db.labels_==-1])/len(labels))*100,2)),"%")
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
fig, axarr = plt.subplots(2, 1,figsize = (16,10))
plt.suptitle('DBScan plot of best estimate', size = 20)
for ind, x in enumerate(set(labels)):
if x == -1:
axarr[0].scatter(data[labels==-1,0],data[labels==-1, 2],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
else:
axarr[0].scatter(x=data[labels==x,0],y=data[labels==x, 2],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0)
# plt.legend()
# plt.show()
for ind, x in enumerate(set(labels)):
if x == -1:
axarr[1].scatter(data[labels==-1,0],data[labels==-1, 1],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
else:
axarr[1].scatter(data[labels==x,0],data[labels==x, 1],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0)
axarr.flat[0].set(xlabel='Monetary', ylabel='Frequency')
axarr.flat[1].set(xlabel='Monetary', ylabel='Recency')
plt.show()
Here we do an initial clustering with DBScan. We find that the initial model has very few outliers, but the number of clusters aren't aligning with previous observations. We will run a silhouette test to see the configuration that gives us the best score.
best_score = -1
best_eps = 0
best_samples = 0
for eps_n in np.linspace(.25,.75,10):
for samples_n in np.linspace(15,50,20):
db = DBSCAN(eps=eps_n, min_samples=samples_n).fit(data)
# db = DBSCAN(eps=eps_n, min_samples=samples_n).fit_predict(data)
# db.fit(matrix)
# db.
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
#print(n_clusters_, eps_n, samples_n)
if n_clusters_ > 2:
silhouette_avg = round(silhouette_score(matrix, labels),2)
if(silhouette_avg>best_score):
best_score = silhouette_avg
best_eps = eps_n
best_samples = samples_n
print("For n_clusters =", n_clusters_, "The average silhouette_score is :", silhouette_avg)
print("For eps =", round(eps_n,3), "min_samples :", round(samples_n,3))
db = DBSCAN(eps=best_eps, min_samples=best_samples).fit(data)
labels = db.labels_
print("Percent of outliers: ",str(round((len(labels[db.labels_==-1])/len(labels))*100,2)),"%")
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
fig, axarr = plt.subplots(2, 1,figsize = (16,10))
plt.suptitle('DBScan plot of best estimate', size = 20)
for ind, x in enumerate(set(labels)):
if x == -1:
axarr[0].scatter(data[labels==-1,0],data[labels==-1, 2],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
else:
axarr[0].scatter(x=data[labels==x,0],y=data[labels==x, 2],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0, label=lbls[ind])
# plt.legend()
# plt.show()
for ind, x in enumerate(set(labels)):
if x == -1:
axarr[1].scatter(data[labels==-1,0],data[labels==-1, 1],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
else:
axarr[1].scatter(data[labels==x,0],data[labels==x, 1],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0, label=lbls[ind])
axarr.flat[0].set(xlabel='Monetary', ylabel='Frequency')
axarr.flat[1].set(xlabel='Monetary', ylabel='Recency')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
We find that after running the silhouette test that the best score gives us 3 clusters and reduces the percentage of outliers to just under 7%. This seems like a better fit, especially given that this is the optimal silhouetter score.
y_kmeans = db.labels_
#visualising the clusters
t1 = getTrace(matrix[y_kmeans == -1, 0], matrix[y_kmeans == -1, 1], matrix[y_kmeans == -1, 2], s= 2, c='black', label = '-1: Outliers')
t2 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 6, c='purple', label = '0: Valued Customers')
t3 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 6, c='orange', label = '1: Average Customers')
t4 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 6, c='lightblue', label = '2: Irregular Customers')
x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Three Customer Segmentations of DBScan", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4])
display(Image('https://i.imgur.com/FsVkTkb.png', width=950, unconfined=True))
# #combined clusters for all clustering methods (DBS, HAC, K-means).
data = matrix
cls = DBSCAN(eps=best_eps, min_samples=best_samples)
cls.fit(data)
dbs_labels = cls.labels_
cls = AgglomerativeClustering(n_clusters=3, linkage='ward')
cls.fit(data)
hac_labels = cls.labels_
cls = KMeans(n_clusters=3, random_state=1)
cls.fit(data)
kmn_labels = cls.labels_
fig = plt.figure(figsize=(12,8))
title = ['DBSCAN','HAC','KMEANS']
for i,l in enumerate([dbs_labels, hac_labels,kmn_labels]):
plt.subplot(3,2,2*i+1)
plt.scatter(data[:, 0], data[:, 2]+np.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Monetary'), plt.ylabel('Frequency')
plt.grid()
plt.title(title[i])
plt.subplot(3,2,2*i+2)
plt.scatter(data[:, 0], data[:, 1]+np.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Monetary'), plt.ylabel('Recency')
plt.grid()
plt.title(title[i])
plt.tight_layout()
plt.show()
Overall Ramifications Summary (with DBscan and HAC): These models allowed us to have the ability to systematically craft a highly targeted marketing campaigns for different customer groups. It aided in customer relationship building an customer conversion. Not just from a non-customer to a customer, but from one type to another. We can aid the sales team in strategically approaching the customer with the correct assumptions. Not to mention, decrease negative reactions from customer due to misidentified targeting. This model serves as a launch off for historical analysis and should not be used for prospects. We believe that all three models did similar clustering segmentation of customers due to the decreased variability of having 3 features. However, the HAC and DBScan actually identified customers with low or zero recency, moderate frequency, and moderate monetary, where as we didn’t see that with K-Means. We think HAC and DBScan both identified customers that can be targeted to purchase products, who otherwise haven't purchased anything recently.
We used three different clustering algorithms to cluster the RFM data. We used the silhouette score to choose the best cluster size within each algorithm. First, we looked at the K-Means clustering algorithm. The silhouette score for 2 clusters is .422 and the score for 3 clusters was .416, a very small difference of .006. The scores for 4 clusters and above fell off after. When we did a visual inpsection of the charts of the clusters, the 3D charts showed good evidence for 3 clusters. The Hierarchical Agglomerative Clustering (HAC) algorithm silhouette score for 2 clusters is .746 and the score for 3 was .684. The dendogram also gave good evidence to support 3 clusters. Visual inpsection of the two charts for HAC there provided more evidence that 3 clusters describes our data very well. Lastly, we utilized the DB Scan algorithm. The best silhouette score using this algorithm was also 3 samples. These charts looked similar to the HAC charts. In each case the Monetary vs Recency charts clustered well going above 3 clusters, but when looking at these in conjunction with the Monetary vs. Frequency charts, three clusters appeared to group the data.
Noteworthy to mention is how the 3D chart for K-means and DB Scarn helped us understand how the data looked. This turned out be a natural fit as we were comparing three different features. Understanding the data from a 3D perspective helped us better understand what the two different 2D charts were showing us.
This was just an interesting experiment to see what happens to our customer segmentation if we use Alpha Shapes. According to MatLab, "The alpha shape of a set of points is a generalization of the convex hull and a subgraph of the Delaunay triangulation. That is, the convex hull is just one type of alpha shape, and the full family of alpha shapes can be derived from the Delaunay triangulation of a given point set."
#Coding Source: https://plot.ly/python/3d-point-clustering/#reference
#MatLab Source: https://www.mathworks.com/examples/matlab/mw/matlab-ex67295599-alpha-shapes
import plotly.plotly as py
import pandas as pd
import plotly
df = test
df.head() #segmentation.drop(['R_Quartile','F_Quartile','M_Quartile'
scatter = dict(
mode = "markers",
name = "y",
type = "scatter3d",
x = df['Monetary'], y = df['Recency'], z = df['Frequency'],
marker = dict( size=4, color="rgb(23, 190, 207)" )
)
clusters = dict(
alphahull = 7,
name = "y",
opacity = 0.1,
type = "mesh3d",
x = df['Monetary'], y = df['Recency'], z = df['Frequency']
)
layout = dict(
title = '3d point clustering',
scene = dict(
xaxis = dict( zeroline=False ),
yaxis = dict( zeroline=False ),
zaxis = dict( zeroline=False ),
)
)
figg = dict( data=[scatter, clusters], layout=layout )
plotly.offline.plot(figg, auto_open=True)
#The output is an html file. We add the screenshot below.
display(Image('https://i.imgur.com/XR2xuPa.jpg', width=900, unconfined=True))
First with association rules we have to create the rules. We are using a subset of the entire population datasets where the transactions are purchases. We are also using a copy of the original dataset including all transactions where the Customer ID was null. We are not relying on CustomerID for this analysis, rather we will be doing associations amongst the items purchased, so we would rather keep in these rows as they will help with our analysis.
from apyori import apriori
marketing_data_with_ID = marketing_data_copy.copy()
marketing_data_with_ID.drop_duplicates(inplace = True)
# Create the second data set that retains all the Customer ID for the Association Rules
# Change the rows with missing data in CustomerID column:
marketing_data.dropna(subset=['Description'], inplace=True)
country_map = {'Bahrain':999991, 'EIRE':999992, 'France':999993, 'Hong Kong':999994, 'Israel':999995,
'Portugal':999996, 'Spain':999997, 'Switzerland':999998, 'United Kingdom':999999, 'Unspecified':999990}
marketing_data_with_ID['CustomerID'] = np.where(marketing_data_with_ID.CustomerID.isnull(),marketing_data_with_ID.Country.map(country_map),
marketing_data_with_ID.CustomerID)
for ind,el in enumerate(marketing_data_with_ID['Description']):
if type(el) != str:
marketing_data_with_ID.loc[ind,'Description']='Unknown'
elif re.match('.*[a-zA-Z]+.*',el) is not None:
pass
else:
marketing_data_with_ID.loc[ind,'Description']='Unknown'
# Create the TotalPrice feature
marketing_data_with_ID['TotalPrice'] = marketing_data_with_ID['Quantity']*marketing_data_with_ID['UnitPrice']
by_invoice_with_ID = marketing_data_with_ID.groupby(['InvoiceNo', 'InvoiceDate','Country','CustomerID'], as_index=False)[ 'UnitPrice','Description','Quantity','TotalPrice'].agg(lambda x: list(x))
for ind,el in enumerate(by_invoice_with_ID['Description']):
if type(el[0]) != str:
by_invoice_with_ID.drop(ind,inplace=True)
by_invoice_with_ID['TotalPrice'] = by_invoice_with_ID['TotalPrice'].apply(sum)
by_invoice_with_ID['QuantityTotal'] = by_invoice_with_ID['Quantity'].apply(sum)
by_invoice_with_ID['Description_as_string'] = by_invoice_with_ID['Description'].apply(lambda x: "-".join(sorted(x)))
by_invoice_with_ID['Transaction']=''
for index,row in by_invoice_with_ID.iterrows():
if str(row['InvoiceNo']).startswith("C"):
by_invoice_with_ID.loc[index,'Transaction'] = 'Cancel'
elif str(row['InvoiceNo']).startswith("A"):
by_invoice_with_ID.loc[index,'Transaction'] = 'Adjust'
else:
by_invoice_with_ID.loc[index,'Transaction'] = 'Purchase'
sales_data = by_invoice_with_ID[by_invoice_with_ID.Transaction=="Purchase"]
sales_data = sales_data.copy()
sales_data.loc[:,'Average_per_item_cost'] = sales_data.apply(lambda x: x['TotalPrice'] / sum(x['Quantity']),axis=1)
sales_data.head(5)
timeseries_sales_data = sales_data.drop(['CustomerID','Country','Description','Transaction'],axis=1)
times = pd.to_datetime(sales_data.InvoiceDate)
timeseries_sales_data_group = timeseries_sales_data.groupby(times.dt.date)
timeseries_sales_data_group_avg = timeseries_sales_data_group.median()
width = 5
lag1 = timeseries_sales_data_group_avg['TotalPrice'].shift(1)
lag3 = timeseries_sales_data_group_avg['TotalPrice'].shift(width - 1)
window = lag3.rolling(window=width)
smooth_TotalPrice = window.mean()
smooth_TotalPrice.name='smooth_TotalPrice'
smooth_TotalPrice = smooth_TotalPrice
lag1 = timeseries_sales_data_group_avg['QuantityTotal'].shift(1)
lag3 = timeseries_sales_data_group_avg['QuantityTotal'].shift(width - 1)
window = lag3.rolling(window=width)
smooth_QuantityTotal = window.mean()
smooth_QuantityTotal.name='smooth_QuantityTotal'
smooth_QuantityTotal = smooth_QuantityTotal
timeseries_sales_data_group_avg = pd.concat([timeseries_sales_data_group_avg, smooth_TotalPrice, smooth_QuantityTotal],axis=1)
fig = plt.figure(figsize=(16,10))
plt.plot(timeseries_sales_data_group_avg.index,timeseries_sales_data_group_avg.smooth_QuantityTotal, label='Average Quantity');
plt.plot(timeseries_sales_data_group_avg.index,timeseries_sales_data_group_avg.smooth_TotalPrice, label='Average Total Price')
plt.title('Average sales stats per day',size=20)
plt.legend()
plt.show()
list_purchase_marketing_data_grouping_descriptions = []
for el in by_invoice_with_ID.loc[by_invoice_with_ID['Transaction']=='Purchase','Description']:
if len(el)>1:
list_purchase_marketing_data_grouping_descriptions.append(el)
purchase_rules = apriori(list_purchase_marketing_data_grouping_descriptions, min_support=0.005,min_confidence=.2,
max_confidence=.95,min_lift=0.3,max_lift=100,min_length=2, max_length=3 )
purchase_results = list(purchase_rules)
Markdown("""
## Aproiri results
For the apriori algorithm we wanted to have a range of 2-3 transactions per relationship, meaning there would be 1-2 antecedents and 1 consequent.
We set the support low because we wanted to find as many relationships possible to use in our analysis. In the end you found {num_n} relationships that meet the minimum criteria.
""".format(num_n = len(purchase_results)))
purchase_list = [list(x) for x in purchase_results]
antec = []
conseq = []
support = []
conf = []
lift = []
#Here we are creating a dataframe for each relationship from the apriori rule set
for elem in purchase_list:
antec.append(str(elem[2][0][0]))
conseq.append(str(elem[2][0][1]))
support.append(elem[1])
conf.append(elem[2][0][2])
lift.append(elem[2][0][3])
purchase_df = pd.DataFrame(data={'antecedants':antec,'consequents':conseq,'support':support,
'confidence':conf,'lift':lift})
#Here we are removing anything related to this data coming from frozensets
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r"frozenset({'",""))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r'frozenset({"',""))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r'"})',''))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r"'})",''))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r"', '",' & '))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r'", "',' & '))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.strip())
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r"frozenset({'",""))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r'frozenset({"',""))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r'"})',''))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r"'})",''))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.strip())
purchase_df = purchase_df.sort_values(by='support',ascending=False)
NBR_TOP_VALUES = 25
ants = []
cons = []
loop_round = 0
for ant in purchase_df['antecedants']:
if ant in ants:
pass
elif loop_round >= NBR_TOP_VALUES:
break
else:
ants.append(ant)
loop_round+=1
loop_round = 0
for con in purchase_df['consequents']:
if con in cons:
pass
elif loop_round >= NBR_TOP_VALUES:
break
else:
cons.append(con)
loop_round+=1
top_support_apriori_values = pd.DataFrame(index = ants, columns=cons, data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))
for index, rows in purchase_df.iterrows():
if rows['antecedants'] in top_support_apriori_values.index and rows['consequents'] in top_support_apriori_values.columns:
top_support_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['support']
top_conf_apriori_values = pd.DataFrame(index = ants, columns=cons, data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))
for index, rows in purchase_df.iterrows():
if rows['antecedants'] in top_conf_apriori_values.index and rows['consequents'] in top_conf_apriori_values.columns:
top_conf_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['confidence']
top_lift_apriori_values = pd.DataFrame(index = ants, columns=cons, data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))
for index, rows in purchase_df.iterrows():
if rows['antecedants'] in top_lift_apriori_values.index and rows['consequents'] in top_lift_apriori_values.columns:
top_lift_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['lift']
ant = purchase_df.antecedants.iloc[0]
con = purchase_df.consequents.iloc[0]
sup = purchase_df.support.iloc[0]
lift = purchase_df.lift.iloc[0]
conf = purchase_df.confidence.iloc[0]
Markdown("""
# Association Rules Rundown
With Association rules (implemented with apriori in Python), you find associative relationships between the antecedent and consequents.
It can be easy to think of the relationship as an if then statement, if you purchase **{ant}** then there is a certain likelihood you will buy **{con}**.
There are a number of ways to measure the relationship between the antecedent and the consequent. First the is support, which states what is the proportion of the entire population of transactions that the pair show up in, for **{ant}** and **{con}** **{sup}%** of the entire transactions have that pair.
Next, there is confidence, which indicates that out of all of the transactions that have the antecedent, include the consequent. This means that **{conf}%** of the transactions with **{ant}** include **{con}**.
Last there is lift, which indicate how much more likely the consequent is to be included when the antecedent is on the transaction. Which means **{con}** is **{lift}** times more likely to be purchased when **{ant}** is purchased.
""".format(ant = ant, con=con, lift=round(lift,1), sup = round(sup*100,2), conf = round(conf*100,2)))
ax = sns.clustermap(top_conf_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of confidence of items with the most support',size = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');
for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
#Removes text added to tick labels
i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
i_text = re.sub(r"[\'\"]\)$",'',i_text)
j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
j_text = re.sub(r"[\'\"]\)$",'',j_text)
text = ax.ax_heatmap.text(ind_j, ind_i, int(top_conf_apriori_values.loc[j_text, i_text]*100),
ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))
ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
ax = sns.clustermap(top_lift_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of lift of items with the most support',size = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');
for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
#Removes text added to tick labels
i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
i_text = re.sub(r"[\'\"]\)$",'',i_text)
j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
j_text = re.sub(r"[\'\"]\)$",'',j_text)
text = ax.ax_heatmap.text(ind_j, ind_i, int(top_lift_apriori_values.loc[j_text, i_text]),
ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))
ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
ax = sns.clustermap(top_support_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of support of items with the most support',size = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');
for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
#Removes text added to tick labels
i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
i_text = re.sub(r"[\'\"]\)$",'',i_text)
j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
j_text = re.sub(r"[\'\"]\)$",'',j_text)
text = ax.ax_heatmap.text(ind_j, ind_i, round(top_support_apriori_values.loc[j_text, i_text]*100,1),
ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))
ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
These antecedents and consequents were obtained by sorting off of support, meaning the items selected represented the most common relationships in the transactions. The items seem to be related to everyday consumer products like lunch bags, bags, alarm clocks, and teacup & saucer combos.
It is interesting to see the differences between what provide the greatest lift as opposed to the most popular items. The Jumbo Bag Pink Polkadot is most often purchased with the Jumbo Bag Retrospot at over 4% of the transactions; while the Green Regency Teacup and Saucer gives the biggest lift to the Pink Regency Teacup and Saucer.
The cluster map helps to show that there are certain groups that form from associations.
most_common_conf_apriori_values = pd.DataFrame(index = purchase_df['antecedants'].value_counts()[:NBR_TOP_VALUES].index,
columns=purchase_df['consequents'].value_counts()[:NBR_TOP_VALUES].index,
data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))
for index, rows in purchase_df.iterrows():
if rows['antecedants'] in most_common_conf_apriori_values.index and rows['consequents'] in most_common_conf_apriori_values.columns:
most_common_conf_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['confidence']
most_common_lift_apriori_values = pd.DataFrame(index = purchase_df['antecedants'].value_counts()[:NBR_TOP_VALUES].index,
columns=purchase_df['consequents'].value_counts()[:NBR_TOP_VALUES].index,
data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))
for index, rows in purchase_df.iterrows():
if rows['antecedants'] in most_common_lift_apriori_values.index and rows['consequents'] in most_common_lift_apriori_values.columns:
most_common_lift_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['lift']
most_common_support_apriori_values = pd.DataFrame(index = purchase_df['antecedants'].value_counts()[:NBR_TOP_VALUES].index,
columns=purchase_df['consequents'].value_counts()[:NBR_TOP_VALUES].index,
data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))
for index, rows in purchase_df.iterrows():
if rows['antecedants'] in most_common_support_apriori_values.index and rows['consequents'] in most_common_support_apriori_values.columns:
most_common_support_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['support']
ax = sns.clustermap(most_common_conf_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of confidence of most common descriptions',size = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');
for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
#Removes text added to tick labels
i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
i_text = re.sub(r"[\'\"]\)$",'',i_text)
j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
j_text = re.sub(r"[\'\"]\)$",'',j_text)
text = ax.ax_heatmap.text(ind_j, ind_i, int(most_common_conf_apriori_values.loc[j_text, i_text]*100),
ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))
ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
ax = sns.clustermap(most_common_lift_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of lift of most common descriptions',size = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');
for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
#Removes text added to tick labels
i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
i_text = re.sub(r"[\'\"]\)$",'',i_text)
j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
j_text = re.sub(r"[\'\"]\)$",'',j_text)
text = ax.ax_heatmap.text(ind_j, ind_i, int(most_common_lift_apriori_values.loc[j_text, i_text]),
ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))
ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
ax = sns.clustermap(most_common_support_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of support of top common descriptions',size = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');
for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
#Removes text added to tick labels
i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
i_text = re.sub(r"[\'\"]\)$",'',i_text)
j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
j_text = re.sub(r"[\'\"]\)$",'',j_text)
text = ax.ax_heatmap.text(ind_j, ind_i, round(most_common_support_apriori_values.loc[j_text, i_text]*100,1),
ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))
ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
With the graph above we are showing the items that show up most commonly in the antecedents and consequents. With this relationship we then display the relative confidence for each pair.
This graph is interesting because there are a couple of distinct sections with relative confidence. For this chart there seems to be a range of .30 to 1 for the confidence between the consequent and the antecedent. We see a lot of dotcom postage in the antecedent category. Looking at the timeseries graph of sales patterns over time you see there is increased sales due to increased units sold toward the end of the year, this could be a result of increased purchases for anticipated holiday sales.
With increasing focus on customer engagement, management, and service, the models we’ve developed represents a growing sector in the marketing and business analytics space. For one, we found several software companies selling the RFM methodology encapsulated in a clustering or classification technique. Some of these companies include Putler, Optimove, Canopy Labs, and Clever Analytics. The needs and demand is definitely there.
Our performance considering the data set we have was above average. One improvement for having tighter cluster groups would be to apply our model to countries in isolation. As we illustrated in the Data Understanding section, geography played a role, directly or indirectly, on the frequency and volume of sales. Regardless, we wanted to test our model’s limits and it still generated numbers that can actually be used for segmentation of customers.
Going forward, we will not discuss hypothetical deployment by “other parties” or companies because the model we produced will actually be deployed on 8/14/2018 by Murtada Shubbar, a member of this team. Murtada works in the Pricing and Profitability department of a billion dollar plus distribution company, with over 80,000 customers. Surprisingly, the only customer segmentation that exists currently at that company is as follows:
display(Image('https://i.imgur.com/noJ1n6M.png', width=450, unconfined=True))
display(Image('https://i.imgur.com/RLe7IPQ.png', width=450, unconfined=True))
Customer Level represents a static customer assignment based on what Distribution Center (DC) managers think the customer brings into the company. Customer Industry represents the main focus of what the customer sells, although, most overlap in more than one area. From this combination, the department determines which customer will get which price point. In theory, because there are 5 different price points for each item sold, the customer that brings in the most sales will receive the most discounted price and so on.
Several problems arise with this methodology:
The DC Manager is biased, and often to increase sales, they change customer level attributes to give them better pricing.
The customer segmentation is static. The only way to reclassify all the customers is to make the DC manager go through them.
The customer industry classification has no purpose in methodology of choosing which customer getting which price.
Thus, the weakness of the customer classification system that’s currently deployed at Murtada’s company is exposed. Only sales volume is used, hence only one of our variables: Monetary.
As we have indicated in the past, the model we’re using for customer segmentation is more dynamic, objective, and insightful. The model gives us a more holistic and historically more accurate picture of the customers. Furthermore, exploring the three dimensional nature of the customer, we’ve suggested researched action steps for the sales team. Actions such as implementing more aggressive pricing incentives or beneficially letting go of “lost customers.”
Murtada’s director plans to run a pilot with the new customer segmentations produced to test if they have increased conversion and revenue rates. This ultimately will also decrease marketing costs, since the department will not target low yielding customers.
Additional data is not needed for this model, however, an additional variable will help elevate the customer segmentation process to a new level. The new feature added to the RFM would be ‘V’, which stands for velocity. How fast is the customer growing? What’s their trajectory. This adds an additional dimension to the analysis: RFMV. The model will need to be updated twice a year, in the winter season and in the summer season. However, for other companies, they can updated it as much as they want.
$\textbf{3D Modeling}$ The additional 3D modeling demonstrates the relationship between the 3 primary metrics for customer behavior. The use of 3D modeling better shows the distinct clusters and how effective the different clustering techniques are.
$\textbf{Association Rules Modeling}$ The addition of association rule modeling shows the buying patterns associated with the items that are purchased. The three primary metrics associated with association rules (lift, support, and confidence) were mapped out using clustermaps.
$\textbf{Timeseries graphs}$ The purchasing patterns of time were displayed in a graph to show the changes in buying patterns over time. This helped reinforce some of the observations found in the types of items purchased for the association rule modeling.
$\textbf{Grouping of cancelled orders}$ We attempted to join the cancelled orders with the purchases but found a small share of the transactions matched. It was found that about 10% of the cancelled orders could be confidently matched up with a purchase invoice. We went off of description, price, invoice order (any purchases happening after a cancelled order were dropped) and quantity. With such a small matching rate we felt there was insufficient data to make any sort of proper analysis.
$\textbf{3D Modeling using Alpha Shapes}$ Experimental modeling technique that did not result in any additional insight.